Developer CD Series 1993 March: Other People's Memory

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1993…ch: Other People's Memory / ADC Developer CD (1993-03) (''Other People's Memory'')_iso / Dev.CD Mar 93.iso / Development Platforms / CSMP Digests / csmp-v1-018.txt < prev next >

Wrap

Text File | 1992-11-18 | 56.2 KB | 1,528 lines | [TEXT/MPS ]

C.S.M.P. Digest Sun, 15 Mar 92 Volume 1 : Issue 18 Today's Topics: Determining size of file Opening docs in non-app-event-aware apps--SOLUTION Fatest code to fill memory? The Comp.Sys.Mac.Programmer Digest is moderated by Michael A. Kelly. These digests are available (by using FTP, account anonymous, your email address as password) in the pub/mac/csmp-digest directory on ftp.cs.uoregon. edu (try skinner.cs.uoregon.edu if that doesn't work). This is also the home of the comp.sys.mac.programmer Frequently Asked Questions list. These digests are also available via email. Just send a note saying that you want to be on the digest mailing list to mkelly@cs.uoregon.edu, and you will automatically receive each new digest as it is created. The articles in these digests are taken directly from comp.sys.mac.programmer. They are not edited; all articles included in this digest are in their original posted form. The only articles that are -not- included in these digests are those which didn't receive any replies (except those that give information rather than ask a question). All replies to each article are concatenated onto the original article in the order in which they were received. Article threads are not added to the digests until the last article added to the thread is at least one month old (this is to ensure that the thread is dead before adding it to the digests). Send administrative mail to mkelly@cs.uoregon.edu. ------------------------------------------------------- From: crow@ccwf.cc.utexas.edu (David L. Crow) Subject: Determining size of file Date: 9 Feb 92 01:51:43 GMT Organization: The University of Texas at Austin, Austin TX I am trying to write a program that can determine the size of a file. I am using Think C 5.0 with System 7. In UNIX, I would use the "stat" sub- routine, but I am a little unsure how to get this info on the Mac. I have been trying to use FSOpen and PBHGetFInfo, but they aren't working as I would suspect from reading Inside Mac. Are these the right routines to use? Is there a better way? Maybe using the ANSI library instead of the Toolbox? Thanks! -- David L. Crow crow@ccwf.cc.utexas.edu - ------------------------- From: mcmath@csb1.nlm.nih.gov (Chuck McMath) Subject: Determining size of file Date: 10 Feb 92 12:53:32 GMT Organization: MSD In article <66447@ut-emx.uucp>, crow@ccwf.cc.utexas.edu (David L. Crow) writes: > > > I am trying to write a program that can determine the size of a file. I > am using Think C 5.0 with System 7. In UNIX, I would use the "stat" sub- > routine, but I am a little unsure how to get this info on the Mac. I have > been trying to use FSOpen and PBHGetFInfo, but they aren't working as I > would suspect from reading Inside Mac. Are these the right routines to > use? Is there a better way? Maybe using the ANSI library instead of the > Toolbox? > > Thanks! > -- > David L. Crow crow@ccwf.cc.utexas.edu > > Open the file, then call (Pascal syntax): err := GetEOF(refNum, logEOF); logEOF === 'logical end-of-file' === size of file in bytes. Inside Mac Volume II, pages 93,112. Cheers! chuck --chuck mcmath- mcmath@csb1.nlm.nih.gov MSD, Inc. * National Library of Medicine * National Institutes of Health Bethesda, MD 20894 - ------------------------- From: keith@Apple.COM (Keith Rollin) Subject: Determining size of file Date: 10 Feb 92 21:25:26 GMT Organization: Apple Computer Inc., Cupertino, CA In article <66447@ut-emx.uucp> crow@ccwf.cc.utexas.edu (David L. Crow) writes: > > I am trying to write a program that can determine the size of a file. I > am using Think C 5.0 with System 7. In UNIX, I would use the "stat" sub- > routine, but I am a little unsure how to get this info on the Mac. I have > been trying to use FSOpen and PBHGetFInfo, but they aren't working as I > would suspect from reading Inside Mac. Are these the right routines to > use? Is there a better way? Maybe using the ANSI library instead of the > Toolbox? > Use PBHGetFInfo or PBGetCatInfo. You don't need to futz with FSOpen. Using the built-in File Manager calls is the best way. Using ANSI calls usually have the problem of having to translate themselves into the built-in calls. This takes longer, and runs the risk of losing something in the translation. -- - ---------------------------------------------------------------------------- Keith Rollin --- <Taligent .signature under construction> Disclaimer: Pretty soon, I really _won't_ be speaking for Apple... - ------------------------- From: oster@well.sf.ca.us (David Phillip Oster) Subject: Determining size of file Date: 13 Feb 92 13:59:31 GMT Organization: Whole Earth 'Lectronic Link, Sausalito, CA In article <62649@apple.Apple.COM> keith@Apple.COM (Keith Rollin) writes: >Use PBHGetFInfo or PBGetCatInfo. You don't need to futz with FSOpen. Don't you have to check the file system type before you can use these calls? I believe that "bad things will happen" if you try to use these on a MFS file system. At least FSOpen() doesn't care whether the input is MFS or HFS. -- -- David Phillip Oster - At least the government doesn't make death worse. -- oster@well.sf.ca.us = {backbone}!well!oster - ------------------------- From: jcav@quads.uchicago.edu (JohnC) Subject: Determining size of file Date: 14 Feb 92 21:50:36 GMT Organization: The Royal Society for Putting Things on Top of Other Things In article <29997@well.sf.ca.us> oster@well.sf.ca.us (David Phillip Oster) writes: >In article <62649@apple.Apple.COM> keith@Apple.COM (Keith Rollin) writes: >>Use PBHGetFInfo or PBGetCatInfo. You don't need to futz with FSOpen. >Don't you have to check the file system type before you can use these calls? >I believe that "bad things will happen" if you try to use these on a MFS >file system. At least FSOpen() doesn't care whether the input is MFS or HFS. Well, if HFS is running then the worst that can happen is that you'll get a "wrongVolType" error. Of course, if HFS is _not_ running then you'll get a system error when you try to call _PBGetCatInfo. -- John Cavallino | EMail: jcav@midway.uchicago.edu University of Chicago Hospitals | John_Cavallino@uchfm.bsd.uchicago.edu Office of Facilities Management | USMail: 5841 S. Maryland Ave, MC 0953 B0 f++ c+ g+ k s+(+) e+ h- pv | Chicago, IL 60637 --------------------------- From: greeny@top.cis.syr.edu (Jonathan Greenfield) Subject: Opening docs in non-app-event-aware apps--SOLUTION Organization: CIS Dept., Syracuse University Date: Sat, 8 Feb 92 21:28:01 EST Thanks to the several people who tried to help me with this problem. It was actually quite frustrating to have people telling me "just send an 'odoc' event, and the Process Manager will take care of it" since I had already tried doing this. I eventually decided to mess around and see if there was some kind of "trick" that was required. What I discovered is that (for some reason that is unknown to me, and apparently undocumented) you have to await a reply when you send the 'odoc' event, in order for the Process Manager to properly open the document. Perhaps DTS should put this information in one of their "snippets" or something like that. (Or perhaps they already have, though I'm not aware of it...) In any case, the application involved, LaunchPad, was just submitted to the sumex archive, and should be available soon. It is a very simple application that allows Clean-Desktop-types to clear off their desktops, and still have drag-and-drop access to a lot of different applications. It's freeware, so try it out, and let me know what you think. -- J. S. Greenfield greeny@top.cis.syr.edu (I like to put 'greeny' here, but my d*mn system wants a *real* name!) "What's the difference between an orange?" - ------------------------- From: grobbins@Apple.COM (Grobbins) Subject: Opening docs in non-app-event-aware apps--SOLUTION Date: 11 Feb 92 09:01:42 GMT Organization: Apple CTS In article <1992Feb8.212801.7577@newstand.syr.edu> greeny@top.cis.syr.edu (Jonathan Greenfield) writes: >What I discovered is that (for some reason that is >unknown to me, and apparently undocumented) you have to await a reply when >you send the 'odoc' event, in order for the Process Manager to properly >open the document. What you have to do is call WaitNextEvent, since events don't get sent until WNE time. Using kAEWaitReply makes the Apple Event manager call WaitNextEvent for you, as mentioned on page 6-60 of Inside Mac VI. >Perhaps DTS should put this information in one of their "snippets" or >something like that. (Or perhaps they already have, though I'm not aware >of it...) The latest information from DTS is in the Tech Notes and the Q&A Stack. Grobbins grobbins@apple.com - ------------------------- From: greeny@top.cis.syr.edu (Jonathan Greenfield) Subject: Opening docs in non-app-event-aware apps--SOLUTION Date: 13 Feb 92 18:05:13 GMT Organization: CIS Dept., Syracuse University In article <62679@apple.Apple.COM> grobbins@Apple.COM (Grobbins) writes: >In article <1992Feb8.212801.7577@newstand.syr.edu> greeny@top.cis.syr.edu (Jonathan Greenfield) writes: >>What I discovered is that (for some reason that is >>unknown to me, and apparently undocumented) you have to await a reply when >>you send the 'odoc' event, in order for the Process Manager to properly >>open the document. > >What you have to do is call WaitNextEvent, since events don't get sent >until WNE time. Using kAEWaitReply makes the Apple Event manager call >WaitNextEvent for you, as mentioned on page 6-60 of Inside Mac VI. This is not a sufficient explanation, since the Apple event is properly posted, without any trouble, as long as the 'odoc' does not have to be converted to "puppet strings." Since the AE posting behavior differs depending upon whether or not conversion to "puppet strings" is necessary, I can only conclude that the need to wait is due to the Process Manager's need to make the conversion, and not due to the Event Manager's method for posting the event. -- J. S. Greenfield greeny@top.cis.syr.edu (I like to put 'greeny' here, but my d*mn system wants a *real* name!) "What's the difference between an orange?" --------------------------- From: taihou@iss.nus.sg (Tng Tai Hou) Subject: Fatest code to fill memory? Organization: Institute of Systems Science, NUS, Singapore Date: Mon, 10 Feb 1992 15:43:46 GMT Can anyone recommend sample 'c' or 680x0 assembly code to fill contiguous memory with some value? For example: int i; Ptr p; p = baseAddr; for (i=0; i<100; i++ *p++ = color; Is this the best code? I use ThinkC 5.0. Would appreciate all kinds of answer on this newsgroup. My little code segment is for a FTQD (Faster Than QuickDraw) 8-bit line drawing routine, and also a convex polygon fill routine. Thanks in advance. Tai Hou Singapore - ------------------------- From: CXT105@psuvm.psu.edu (Christopher Tate) Subject: Fatest code to fill memory? Date: 10 Feb 92 21:32:15 GMT Organization: Penn State University In article <1992Feb10.154346.23488@nuscc.nus.sg>, taihou@iss.nus.sg (Tng Tai Hou) says: > >Can anyone recommend sample 'c' or 680x0 assembly code to fill >contiguous memory with some value? For example: > >int i; >Ptr p; > >p = baseAddr; >for (i=0; i<100; i++) > *p++ = color; Going with straight assembly is your best option for a highly-optimized application like fast graphics. Try something like: asm { movea baseAddr, a0 /* equivalent to Ptr p above */ move 100, d0 /* loop counter */ move.l 0x1C1C1C1C,d1 /* assuming 8-bit pixels set to hex 1C */ @1: move.l d1,(a0)+ /* write 4 pixels */ dbeq d0, @1 /* decrement d0, branch to @1 if non-zero */ } This code is really icky, but the important part is that it writes 4 pixels per access (assuming 8-bit pixels), and uses the DBEQ instruction for speed. Caches will love that tight loop. Caveats are that it also assumes that you're moving a multiple of 4 pixels -- if you aren't, then you'll have to adjust the limits of the loop accordingly, and special-case the last three or fewer pixels. Also, it assumes that the base address is word aligned. If it isn't (e.g. there's an extra pixel at the beginning of a scan line), you'll have to special case that, too. Setting D1 to be the proper value without being able to assume that the value is a constant is a little more complex, but not really hard. It's left as an exercise to the reader. :-) - ----- Christopher Tate | Cryptogram #7: cxt105@psuvm.psu.edu | CXT105@PSUVM.BITNET | Z XZG AYRPOVR LVTLPYTW - -------------------------------| YL IYQW TYDPR. Send me the answer; I love mail! | - ------------------------- From: jesjones@milton.u.washington.edu (Jesse Jones) Subject: Fatest code to fill memory? Organization: University of Washington, Seattle Date: Tue, 11 Feb 1992 00:36:54 GMT Chris Tate is right: assembly code is definitely the best way to go if you want your graphic routines to run as fast as possible. The code he has is fine as long as you remember that the decrement and branch instructions are restricted to word length counter registers. I wrote an assembly routine a while back that fills an arbitrary amount of memory with words. If you're stuffing longwords you can speed this up some. The routine is in Modula-2 and in the form of an inline procedure. The machine language was generated using a DA called Quik Hex (which I highly recommend). PROCEDURE FillMem (adr, bytes: ADDRESS; filler: WORD); INLINE2(321FH), (* MOVE.W (A7)+, D1 *) INLINE2(241FH), (* MOVE.L (A7)+, D2 *) INLINE2(201FH), (* MOVE.L (A7)+, D0 *) INLINE2(0A055H), (* StripAddress *) INLINE2(2040H), (* MOVE.L D0, A0 *) INLINE2(30C1H), (* MOVE.W D1, (A0)+ *) INLINE2(5582H), (* SUBQ.L #2, D2 *) INLINE2(6EFAH); (* BGT.S -4 *) --Jesse - ------------------------- From: neeri@iis.ethz.ch (Matthias Ulrich Neeracher) Subject: Fatest code to fill memory? Date: 11 Feb 92 16:43:44 GMT Organization: Integrated Systems Laboratory, ETH, Zurich In article <92041.163215CXT105@psuvm.psu.edu> Christopher Tate <CXT105@psuvm.psu.edu> writes: >In article <1992Feb10.154346.23488@nuscc.nus.sg>, taihou@iss.nus.sg (Tng Tai >Hou) says: >> >>Can anyone recommend sample 'c' or 680x0 assembly code to fill >>contiguous memory with some value? For example: >> >>int i; >>Ptr p; >> >>p = baseAddr; >>for (i=0; i<100; i++) >> *p++ = color; > >Going with straight assembly is your best option for a highly-optimized >application like fast graphics. Try something like: > >asm { > movea baseAddr, a0 /* equivalent to Ptr p above */ > move 100, d0 /* loop counter */ > move.l 0x1C1C1C1C,d1 /* assuming 8-bit pixels set to hex 1C */ >@1: move.l d1,(a0)+ /* write 4 pixels */ > dbeq d0, @1 /* decrement d0, branch to @1 if non-zero */ >} > >This code is really icky, but the important part is that it writes 4 >pixels per access (assuming 8-bit pixels), and uses the DBEQ instruction >for speed. Caches will love that tight loop. But you probably can do better than that by unrolling the inner loop by a factor of 8, resulting in: @1 MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ DBEQ D0, @1 >Caveats are that it also assumes that you're moving a multiple of 4 >pixels -- if you aren't, then you'll have to adjust the limits of the >loop accordingly, and special-case the last three or fewer pixels. >Also, it assumes that the base address is word aligned. If it isn't >(e.g. there's an extra pixel at the beginning of a scan line), you'll >have to special case that, too. My above code is even more tricky in this respect, and you have to ask yourself whether the work is justifed. For an interesting study in loop unrolling, take a look at Apple's implementation of _BlockMove. Matthias - --- Matthias Neeracher neeri@iis.ethz.ch `We say "gestalt" when things combine to act in ways we can't explain' -- Marvin Minsky, _The Society Of Mind_ - ------------------------- From: Michael_Hecht@mac.sas.com (Michael Hecht) Subject: Fatest code to fill memory? Date: 11 Feb 92 16:32:15 GMT Organization: SAS Institute Inc. In article <NEERI.92Feb11104344@iis.ethz.ch>, neeri@iis.ethz.ch (Matthias Ulrich Neeracher) writes: > > In article <92041.163215CXT105@psuvm.psu.edu> > Christopher Tate <CXT105@psuvm.psu.edu> writes: > > > >In article <1992Feb10.154346.23488@nuscc.nus.sg>, > > taihou@iss.nus.sg (Tng Tai Hou) says: > >> > >>Can anyone recommend sample 'c' or 680x0 assembly code to fill > >>contiguous memory with some value? For example: > >> > >>[simple byte fill loop example deleted] > > > >Going with straight assembly is your best option for a highly-optimized > >application like fast graphics. Try something like: > > > >[longword assembler loop example deleted] > > > >[...] it writes 4 > >pixels per access (assuming 8-bit pixels), and uses the DBEQ instruction > >for speed. Caches will love that tight loop. > > But you probably can do better than that by unrolling the inner loop by a > factor of 8, resulting in: > >[unrolled assembler loop example deleted] > > >Caveats are that it also assumes that you're moving a multiple of 4 > >pixels. [...] Also, it assumes that the base address is word aligned. [...] > > My above code is even more tricky in this respect, and you have to ask yourself > whether the work is justifed. Here's some code I wrote a while back. It "unrolls" the loop by using the movem instruction to fill memory in 28-byte chunks. This code works in THINK C 4; I haven't yet checked that it also works with THINK C 5. However, the THINK C 5 manual states that register assignment is disabled for any function containing asm statements. Note that all the caveats mentioned above are handled here. --Michael ======================================================================= Michael P. Hecht | Internet: Michael_Hecht@mac.sas.com SAS Institute Inc.; Cary, NC USA | AppleLink: SAS.HECHT ======================================================================= /* Number of registers we can use */ #define NREGS 7 /* Number of bytes we can fill in one chunk */ #define AMOUNT NREGS*sizeof(long) /* Fill memory starting at p for len bytes with c */ void FillMem( char *p, short len, char c ) { /* Use one address register for q */ register char * q; /* Use five data registers... */ register long r1, r2, r3, r4, r5; /* ...and remaining two address registers for filling */ register char *r6, *r7; /* Sanity check for len */ if( len <= 0 ) return; /* Replicate character to all four bytes of r1 */ r1 = c & 0xFF; r1 |= r1 << 8; r1 |= r1 << 16; /* Fill all registers with fill character */ r2 = r3 = r4 = r5 = r1; r6 = r7 = ( char * )r1; /* Align p on a long-word address */ while(( long )p & 0x00000003 ) { *p++ = r1; if( !( --len )) break; } /* * Fill as many full chunks as possible. * * We have to use the predecrement mode, because the * 680x0 doesn't allow movem'ing registers to memory * in postincrement mode (that's only allowed in the * opposite direction). */ q = p + AMOUNT; p += len; for( ; q < p; q += 2*AMOUNT ) asm { movem.l r1/r2/r3/r4/r5/r6/r7,-(q) } /* Fill any leftover partial chunk, a byte at a time */ q -= AMOUNT; for( ; q < p; ) asm { move.b r1,(q)+ } } - ------------------------- From: suitti@ima.isc.com (Stephen Uitti) Subject: Fatest code to fill memory? Organization: Interactive Systems, Cambridge, MA 02138-5302 Date: Tue, 11 Feb 1992 18:06:44 GMT In article <92041.163215CXT105@psuvm.psu.edu> Christopher Tate <CXT105@psuvm.psu.edu> writes: >In article <1992Feb10.154346.23488@nuscc.nus.sg>, taihou@iss.nus.sg (Tng Tai >Hou) says: >>Can anyone recommend sample 'c' or 680x0 assembly code to fill >>contiguous memory with some value? For example: >> >>int i; >>Ptr p; >> >>p = baseAddr; >>for (i=0; i<100; i++) >> *p++ = color; > >Going with straight assembly is your best option for a highly-optimized >application like fast graphics. Try something like: > >asm { > movea baseAddr, a0 /* equivalent to Ptr p above */ > move 100, d0 /* loop counter */ > move.l 0x1C1C1C1C,d1 /* assuming 8-bit pixels set to hex 1C */ >@1: move.l d1,(a0)+ /* write 4 pixels */ > dbeq d0, @1 /* decrement d0, branch to @1 if non-zero */ >} Note: dbeq checks the condition codes first. You need to jump over the move at @1: for the first loop. Also, the code is not good for clearing to zeros, since the condition code will be set to exit the loop on the first pass. See code for "amemset", below. When it all comes down, I couldn't get dbeq to work right. I did this in Think C 5.02, on a Mac IIci with cache board. In the options: Code Optimization I had "Honor 'register' first" were set. Note: all of these routines expect "count" to mean "this many long words", not "this many bytes". /* This is the code that I'd use first */ /* Assume long aligned "buf" */ void cmemset(register long *buf, register long value, register unsigned long count) { do { *buf++ = value; } while (--count); } /* This compiles to the same as the above code */ void bmemset(register long *buf, register long value, register unsigned long count) { asm { @1: move.l value,(buf)+ /* do { *buf++ = value; */ subq.l #1, count /* } while (--count); */ bne.s @1 } } /* This is how to use "dbeq" correctly. * Note that it will exit early (after the first move.l * if "value" is zero. This is, of course, wrong. */ void amemset(register long *buf, register long value, register unsigned long count) { asm { bra.s @2 /* dbeq checks first */ @1: move.l value,(buf)+ /* do { *buf++ = value; */ @2: dbeq count, @1 /* } while (--count); */ } } Note that the thinkC "memset" ANSI routine uses one move.b per loop. It uses "asm", but it would have been the same code in C. For my money, I'd let the C compiler do the linkage work when doing assembly. Let it figure out where to put variables. Let it do the C calling sequence. Just check the work it did with the Disassemble command. Finally, let's examine Duff's device for loop unrolling. void dmemset(register long *buf, register long value, register unsigned long count) { register unsigned long loop; loop = (count + 8 - 1) >> 3; switch (count & (8 - 1)) { case 0: do { *buf++ = value; case 7: *buf++ = value; case 6: *buf++ = value; case 5: *buf++ = value; case 4: *buf++ = value; case 3: *buf++ = value; case 2: *buf++ = value; case 1: *buf++ = value; } while (--loop); } } This compiles a nicely unrolled loop that will handle any number of longwords. The switch gets you into the loop at the right spot for the first loop, and then the loop fills 8 longwords at a time until it is done. The loop overhead is 1/8th as much. Duff's device looks gross, but remember that a switch is really just a computed "goto", and the "case"'s are just labels. I believe this hack is blessed by ANSI C. Think C uses "subq.l"/"bne.s", rather than "dbeq", so if you wanted that too, you'd have to code it in assembly. Though "Disassemble" can help here, it is further complicated by the fact that "dbeq" checks the condition codes first. "dbeq" is an instruction that "check condition codes for zero, if nonzero subtract one, if not -1 then branch". The instruction I really wanted was the PDP-11 "sob" - "subtract one and branch if not zero". In fact, given that "dbeq" is so complicated, I wouldn't be surprised if "subq.l"/"bne.s" weren't faster, or at least nearly the same speed as "dbeq". I'd say, off hand, that "dbeq" is useless. A short benchmark of the above routines called with a buffer 100,000 4 byte words long, a value of 0x40404040, and a count of 100,000, and each routine is called 1,000 times, the times on a Mac IIci with cache card were: Routine Time in seconds amemset 22 (dbeq) - however, it didn't seem to work right. bmemset 78 (subq.l/bne.s but in asm) cmemset 78 (subq.l/bne.s but in C) dmemset 78 (unrolled 8 times with subq.l/bne.s at bottom) dmemset not was faster than bmemset or cmemset. It was not the speedup I'd have expected. Loop unrolling does not appear to benefit this code. On a new topic: One thing I'd wish for in Think C is that "Disassemble" was completely compatible with "asm". For example, "Disassemble" produces code such as move.l #$01020304,D0 whereas "asm" wants to see move.l #01020304,d0 This is just a pain. Stephen. suitti@ima.isc.com - ------------------------- From: d88-jwa@hemul.nada.kth.se (Jon W{tte) Subject: Fatest code to fill memory? Date: 11 Feb 92 17:28:42 GMT Organization: Royal Institute of Technology, Stockholm, Sweden .ch> neeri@iis.ethz.ch (Matthias Ulrich Neeracher) writes: >>Can anyone recommend sample 'c' or 680x0 assembly code to fill >>contiguous memory with some value? For example: >>p = baseAddr; >>for (i=0; i<100; i++) >> *p++ = color; >@1: move.l d1,(a0)+ /* write 4 pixels */ > dbeq d0, @1 /* decrement d0, branch to @1 if non-zero */ @1 MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ MOVE.L D1, (A0)+ DBEQ D0, @1 What's wrong with MOVEM.L ? That should copy 60 bytes per instruction fetch... For an interesting study in loop unrolling, take a look at Apple's implementation of _BlockMove. True. Especially on 040 ROMs where they move a cache line each time :-) (Couldn't youjust get the address of BlockMove and call that directly ? That might be fast enough !) -- This Signature is distributed under the conditions of the Signature License, available at a fee from h+@nada.kth.se (Jon W{tte) Reading the Signature implies that you accept to be bound by the terms in said License. Should you not agree on any of these terms, you must return the Signature unread to me. - ------------------------- From: orpheus@reed.edu (P. Hawthorne) Subject: Fatest code to fill memory? Date: 12 Feb 92 05:53:24 GMT Organization: Reed College, Portland OR I spent too long trying to best BlockMove. I tried just about everything I could think of. BlockMove is weak from tight inner loops and short moves. For pure flat out speed on short moves from an inner loop, one can do better. If you never have overlapping source and destination blocks, it can be made even faster. Perhaps not surprisingly, it is faster in general to use a predecrement mode move if possible. Since the only thing that makes BlockMove weak is that it eats cache, one's own routines should not. MOVEM is great for the 68000, but the overhead of saving and restoring registers outweighs the reduced cache damage for tight loops, where we have a chance of improving on BlockMove. I was aiming at best performance on 68020/68030 machines, but I suspect that the MOVEM routine included would compare decently with BlockMove on 68040 machines. Here're some of the most successful of the routines I was pitting against BlockMove. Judge me not by their elegance or lack thereof. These reflect the experimentation process I was undergoing, and not everything I learned. I very much look forward to any commentary, however, first machine code project and all... Good, bad, attrocious? I'll defer the explanation of them until someone specifically asks, since I expect folks who are interested to be able to read this with little or no effort. Formal source will be included with the Panacea Class Library, which I shall be releasing Real Soon Now. Incidentally, the fellow who loaned me the m68k series manuals was utterly nonplussed at the idea of a fast memory copy. He thought it was too simple, and offered me a tight printf and a multiprocessor FFT. The general purpose mover I ultimately wrote will first decide if content overlaps and go to postincrement or predecrement cases appropriately. Those check to see if the number of characters is small, medium or large. If large, 4096 chars, then it calls BlockMove. It does not jump straight to the address of BlockMove, which it ought, I know. I include the Pascal INLINE for that in the next article. I'm afraid I lost my ResEdit copy. Oh, one more thing. Please pardon the syntax, these are hand assembled from machine code within ResEdit's hex editor. (Just don't happen to own MPW Asm) Theus (orpheus@reed.edu) Machine code source starts here. Pascal INLINEs in next article: memcpy 1 byte 0 alignment (note: byte count means bytes before looping) +0000 000000 MOVE.L (A7)+,D0 | 201F +0002 000002 MOVEA.L (A7)+,A1 | 225F +0004 000004 MOVEA.L (A7)+,A0 | 205F +0006 000006 SUBQ.L #$1,D0 | 5380 +0008 000008 BMI.S memcpy+$0018 | 6B0E +000A 00000A MOVE.B (A0)+,(A1)+ | 12D8 +000C 00000C DBF D0,memcpy+$000A | 51C8 FFFC +0010 000010 SUBI.L #$00010000,D0 | 0480 0001 0000 +0016 000016 BGT.S memcpy+$000A | 6EF2 memcpy 2 byte 0 alignment +0000 000000 MOVE.L (A7)+,D0 | 201F +0002 000002 MOVEA.L (A7)+,A1 | 225F +0004 000004 MOVEA.L (A7)+,A0 | 205F +0006 000006 LSR.L #$1,D0 | E288 +0008 000008 BCC.S memcpy+$000E | 6404 +000A 00000A MOVE.B (A0)+,(A1)+ | 12D8 +000C 00000C TST.L D0 | 4A80 +000E 00000E BEQ.S memcpy+$0020 | 6710 +0010 000010 SUBQ.L #$1,D0 | 5380 +0012 000012 MOVE.W (A0)+,(A1)+ | 32D8 +0014 000014 DBF D0,memcpy+$0012 | 51C8 FFFC +0018 000018 SUBI.L #$00010000,D0 | 0480 0001 0000 +001E 00001E BGT.S memcpy+$0012 | 6EF2 memcpy 8 byte 0 alignment +0000 000000 MOVE.L (A7)+,D0 | 201F +0002 000002 MOVEA.L (A7)+,A1 | 225F +0004 000004 MOVEA.L (A7)+,A0 | 205F +0006 000006 LSR.L #$1,D0 | E288 +0008 000008 BCC.S memcpy+$000E | 6404 +000A 00000A MOVE.B (A0)+,(A1)+ | 12D8 +000C 00000C TST.L D0 | 4A80 +000E 00000E BEQ.S memcpy+$0036 | 6726 +0010 000010 LSR.L #$1,D0 | E288 +0012 000012 BCC.S memcpy+$001A | 6406 +0014 000014 MOVE.W (A0)+,(A1)+ | 32D8 +0016 000016 TST.L D0 | 4A80 +0018 000018 BEQ.S memcpy+$0036 | 671C +001A 00001A LSR.L #$1,D0 | E288 +001C 00001C BCC.S memcpy+$0024 | 6406 +001E 00001E MOVE.L (A0)+,(A1)+ | 22D8 +0020 000020 TST.L D0 | 4A80 +0022 000022 BEQ.S memcpy+$0036 | 6712 +0024 000024 SUBQ.L #$1,D0 | 5380 +0026 000026 MOVE.L (A0)+,(A1)+ | 22D8 +0028 000028 MOVE.L (A0)+,(A1)+ | 22D8 +002A 00002A DBF D0,memcpy+$0026 | 51C8 FFFA +002E 00002E SUBI.L #$00010000,D0 | 0480 0001 0000 +0034 000034 BGT.S memcpy+$0026 | 6EF0 memcpy 32 byte 0 alignment +0000 000000 MOVE.L (A7)+,D0 | 201F +0002 000002 MOVEA.L (A7)+,A1 | 225F +0004 000004 MOVEA.L (A7)+,A0 | 205F +0006 000006 LSR.L #$1,D0 | E288 +0008 000008 BCC.S memcpy+$000E | 6404 +000A 00000A MOVE.B (A0)+,(A1)+ | 12D8 +000C 00000C TST.L D0 | 4A80 +000E 00000E BEQ.S memcpy+$005E | 674E +0010 000010 LSR.L #$1,D0 | E288 +0012 000012 BCC.S memcpy+$001A | 6406 +0014 000014 MOVE.W (A0)+,(A1)+ | 32D8 +0016 000016 TST.L D0 | 4A80 +0018 000018 BEQ.S memcpy+$005E | 6744 +001A 00001A LSR.L #$1,D0 | E288 +001C 00001C BCC.S memcpy+$0024 | 6406 +001E 00001E MOVE.L (A0)+,(A1)+ | 22D8 +0020 000020 TST.L D0 | 4A80 +0022 000022 BEQ.S memcpy+$005E | 673A +0024 000024 LSR.L #$1,D0 | E288 +0026 000026 BCC.S memcpy+$0030 | 6408 +0028 000028 MOVE.L (A0)+,(A1)+ | 22D8 +002A 00002A MOVE.L (A0)+,(A1)+ | 22D8 +002C 00002C TST.L D0 | 4A80 +002E 00002E BEQ.S memcpy+$005E | 672E +0030 000030 LSR.L #$1,D0 | E288 +0032 000032 BCC.S memcpy+$0040 | 640C +0034 000034 MOVE.L (A0)+,(A1)+ | 22D8 +0036 000036 MOVE.L (A0)+,(A1)+ | 22D8 +0038 000038 MOVE.L (A0)+,(A1)+ | 22D8 +003A 00003A MOVE.L (A0)+,(A1)+ | 22D8 +003C 00003C TST.L D0 | 4A80 +003E 00003E BEQ.S memcpy+$005E | 671E +0040 000040 SUBQ.L #$1,D0 | 5380 +0042 000042 MOVE.L (A0)+,(A1)+ | 22D8 +0044 000044 MOVE.L (A0)+,(A1)+ | 22D8 +0046 000046 MOVE.L (A0)+,(A1)+ | 22D8 +0048 000048 MOVE.L (A0)+,(A1)+ | 22D8 +004A 00004A MOVE.L (A0)+,(A1)+ | 22D8 +004C 00004C MOVE.L (A0)+,(A1)+ | 22D8 +004E 00004E MOVE.L (A0)+,(A1)+ | 22D8 +0050 000050 MOVE.L (A0)+,(A1)+ | 22D8 +0052 000052 DBF D0,memcpy+$0042 | 51C8 FFEE +0056 000056 SUBI.L #$00010000,D0 | 0480 0001 0000 +005C 00005C BGT.S memcpy+$0042 | 6EE4 memcpy 256 byte 0 alignment +0000 000000 MOVE.L (A7)+,D0 | 201F +0002 000002 MOVEA.L (A7)+,A1 | 225F +0004 000004 MOVEA.L (A7)+,A0 | 205F +0006 000006 LSR.L #$1,D0 | E288 +0008 000008 BCC.S memcpy+$000E | 6404 +000A 00000A MOVE.B (A0)+,(A1)+ | 12D8 +000C 00000C TST.L D0 | 4A80 +000E 00000E BEQ memcpy+$0118 | 6700 0108 +0012 000012 LSR.L #$1,D0 | E288 +0014 000014 BCC.S memcpy+$001E | 6408 +0016 000016 MOVE.W (A0)+,(A1)+ | 32D8 +0018 000018 TST.L D0 | 4A80 +001A 00001A BEQ memcpy+$0118 | 6700 00FC +001E 00001E LSR.L #$1,D0 | E288 +0020 000020 BCC.S memcpy+$002A | 6408 +0022 000022 MOVE.L (A0)+,(A1)+ | 22D8 +0024 000024 TST.L D0 | 4A80 +0026 000026 BEQ memcpy+$0118 | 6700 00F0 +002A 00002A LSR.L #$1,D0 | E288 +002C 00002C BCC.S memcpy+$0038 | 640A +002E 00002E MOVE.L (A0)+,(A1)+ | 22D8 +0030 000030 MOVE.L (A0)+,(A1)+ | 22D8 +0032 000032 TST.L D0 | 4A80 +0034 000034 BEQ memcpy+$0118 | 6700 00E2 +0038 000038 LSR.L #$1,D0 | E288 +003A 00003A BCC.S memcpy+$004A | 640E +003C 00003C MOVE.L (A0)+,(A1)+ | 22D8 +003E 00003E MOVE.L (A0)+,(A1)+ | 22D8 +0040 000040 MOVE.L (A0)+,(A1)+ | 22D8 +0042 000042 MOVE.L (A0)+,(A1)+ | 22D8 +0044 000044 TST.L D0 | 4A80 +0046 000046 BEQ memcpy+$0118 | 6700 00D0 +004A 00004A LSR.L #$1,D0 | E288 +004C 00004C BCC.S memcpy+$0064 | 6416 +004E 00004E MOVE.L (A0)+,(A1)+ | 22D8 +0050 000050 MOVE.L (A0)+,(A1)+ | 22D8 +0052 000052 MOVE.L (A0)+,(A1)+ | 22D8 +0054 000054 MOVE.L (A0)+,(A1)+ | 22D8 +0056 000056 MOVE.L (A0)+,(A1)+ | 22D8 +0058 000058 MOVE.L (A0)+,(A1)+ | 22D8 +005A 00005A MOVE.L (A0)+,(A1)+ | 22D8 +005C 00005C MOVE.L (A0)+,(A1)+ | 22D8 +005E 00005E TST.L D0 | 4A80 +0060 000060 BEQ memcpy+$0118 | 6700 00B6 +0064 000064 LSR.L #$1,D0 | E288 +0066 000066 BCC.S memcpy+$008E | 6426 +0068 000068 MOVE.L (A0)+,(A1)+ | 22D8 +006A 00006A MOVE.L (A0)+,(A1)+ | 22D8 +006C 00006C MOVE.L (A0)+,(A1)+ | 22D8 +006E 00006E MOVE.L (A0)+,(A1)+ | 22D8 +0070 000070 MOVE.L (A0)+,(A1)+ | 22D8 +0072 000072 MOVE.L (A0)+,(A1)+ | 22D8 +0074 000074 MOVE.L (A0)+,(A1)+ | 22D8 +0076 000076 MOVE.L (A0)+,(A1)+ | 22D8 +0078 000078 MOVE.L (A0)+,(A1)+ | 22D8 +007A 00007A MOVE.L (A0)+,(A1)+ | 22D8 +007C 00007C MOVE.L (A0)+,(A1)+ | 22D8 +007E 00007E MOVE.L (A0)+,(A1)+ | 22D8 +0080 000080 MOVE.L (A0)+,(A1)+ | 22D8 +0082 000082 MOVE.L (A0)+,(A1)+ | 22D8 +0084 000084 MOVE.L (A0)+,(A1)+ | 22D8 +0086 000086 MOVE.L (A0)+,(A1)+ | 22D8 +0088 000088 TST.L D0 | 4A80 +008A 00008A BEQ memcpy+$0118 | 6700 008C +008E 00008E MOVEM.L D1-D7/A2-A6,-(A7) | 48E7 7F3E +0092 000092 LSR.L #$1,D0 | E288 +0094 000094 BCC.S memcpy+$00BE | 6428 +0096 000096 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +009A 00009A MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +009E 00009E ADDA.W #$0030,A1 | D2FC 0030 +00A2 0000A2 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00A6 0000A6 MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00AA 0000AA ADDA.W #$0030,A1 | D2FC 0030 +00AE 0000AE MOVEM.L (A0)+,D2-D7/A4/A5 | 4CD8 30FC +00B2 0000B2 MOVEM.L D2-D7/A4/A5,(A1) | 48D1 30FC +00B6 0000B6 ADDA.W #$0020,A1 | D2FC 0020 +00BA 0000BA TST.L D0 | 4A80 +00BC 0000BC BEQ.S memcpy+$0114 | 6756 +00BE 0000BE SUBQ.L #$1,D0 | 5380 +00C0 0000C0 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00C4 0000C4 MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00C8 0000C8 ADDA.W #$0030,A1 | D2FC 0030 +00CC 0000CC MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00D0 0000D0 MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00D4 0000D4 ADDA.W #$0030,A1 | D2FC 0030 +00D8 0000D8 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00DC 0000DC MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00E0 0000E0 ADDA.W #$0030,A1 | D2FC 0030 +00E4 0000E4 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00E8 0000E8 MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00EC 0000EC ADDA.W #$0030,A1 | D2FC 0030 +00F0 0000F0 MOVEM.L (A0)+,D1-D7/A2-A6 | 4CD8 7CFE +00F4 0000F4 MOVEM.L D1-D7/A2-A6,(A1) | 48D1 7CFE +00F8 0000F8 ADDA.W #$0030,A1 | D2FC 0030 +00FC 0000FC MOVEM.L (A0)+,D1-D4 | 4CD8 001E +0100 000100 MOVEM.L D1-D4,(A1) | 48D1 001E +0104 000104 ADDA.W #$0010,A1 | D2FC 0010 +0108 000108 DBF D0,memcpy+$00C0 | 51C8 FFB6 +010C 00010C SUBI.L #$00010000,D0 | 0480 0001 0000 +0112 000112 BGT.S memcpy+$00C0 | 6EAC +0114 000114 MOVEM.L (A7)+,D1-D7/A2-A6 | 4CDF 7CFE End of machine code source. Pascal INLINEs in next article. - ------------------------- From: orpheus@reed.edu (P. Hawthorne) Subject: Fatest code to fill memory? Date: 12 Feb 92 06:20:48 GMT Organization: Reed College, Portland OR Pascal INLINE source starts here. procedure CopyBlockInline; inline {This is the most general mover included. Does overlaps.} $226E, $000C, $206E, $0010, $202E, $0008, $2208, $D280, $B289, $5DC1, {} $B3C8, $5DC2, $8202, $6700, $00D6, $0C80, $0000, $00C4, $6E3A, $E288, {} $6404, $12D8, $4A80, $6700, $019A, $E288, $6408, $32D8, $4A80, $6700, {} $018E, $E288, $6408, $22D8, $4A80, $6700, $0182, $5380, $22D8, $22D8, {} $51C8, $FFFA, $0480, $0001, $0000, $6EF0, $6000, $016C, $0C80, $0000, {} $0D00, $6E00, $0084, $E288, $6406, $12D8, $4A80, $677A, $E288, $6406, {} $32D8, $4A80, $6770, $E288, $6406, $22D8, $4A80, $6766, $E288, $6408, {} $22D8, $22D8, $4A80, $675A, $E288, $640C, $22D8, $22D8, $22D8, $22D8, {} $4A80, $674A, $E288, $6414, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $4A80, $6732, $5380, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $51C8, $FFDE, $0480, $0001, $0000, $6ED4, $6002, $A02E, $6000, {} $00DA, $0C80, $0000, $00C4, $6E3E, $D1C0, $D3C0, $E288, $6404, $1320, {} $4A80, $6700, $00C2, $E288, $6408, $3320, $4A80, $6700, $00B6, $E288, {} $6408, $2320, $4A80, $6700, $00AA, $5380, $2320, $2320, $51C8, $FFFA, {} $0480, $0001, $0000, $6EF0, $6000, $0094, $0C80, $0000, $0D00, $6E00, {} $0088, $D1C0, $D3C0, $E288, $6406, $1320, $4A80, $677A, $E288, $6406, {} $3320, $4A80, $6770, $E288, $6406, $2320, $4A80, $6766, $E288, $6408, {} $2320, $2320, $4A80, $675A, $E288, $640C, $2320, $2320, $2320, $2320, {} $4A80, $674A, $E288, $6414, $2320, $2320, $2320, $2320, $2320, $2320, {} $2320, $2320, $4A80, $6732, $5380, $2320, $2320, $2320, $2320, $2320, {} $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, {} $2320, $51C8, $FFDE, $0480, $0001, $0000, $6ED4, $6002, $A02E; {} procedure CopyBlock (src, dst: univ Ptr; count: Longint); begin {Glue routine that assumes Think Pascal. Watch your registers.} CopyBlockInline; end; procedure CopyMem (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $0C80, $0000, $00C4, $6E3A, $E288, $6404, $12D8, {} $4A80, $6700, $00BE, $E288, $6408, $32D8, $4A80, $6700, $00B2, $E288, {} $6408, $22D8, $4A80, $6700, $00A6, $5380, $22D8, $22D8, $51C8, $FFFA, {} $0480, $0001, $0000, $6EF0, $6000, $0090, $0C80, $0000, $0D00, $6E00, {} $0084, $E288, $6404, $12D8, $4A80, $677A, $E288, $6406, $32D8, $4A80, {} $6770, $E288, $6406, $22D8, $4A80, $6766, $E288, $6408, $22D8, $22D8, {} $4A80, $675A, $E288, $640C, $22D8, $22D8, $22D8, $22D8, $4A80, $674A, {} $E288, $6414, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $4A80, $6732, $5380, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $51C8, {} $FFDE, $0480, $0001, $0000, $6ED4, $6002, $A02E; {} procedure CopyMemBackwards (src, dst: univ Ptr; count: Longint); inline {Moves memory using predecrement mode} $201F, $225F, $205F, $0C80, $0000, $00C4, $6E3E, $D1C0, $D3C0, $E288, {} $6404, $1320, $4A80, $6700, $00C2, $E288, $6408, $3320, $4A80, $6700, {} $00B6, $E288, $6408, $2320, $4A80, $6700, $00AA, $5380, $2320, $2320, {} $51C8, $FFFA, $0480, $0001, $0000, $6EF0, $6000, $0094, $0C80, $0000, {} $0D00, $6E00, $0088, $D1C0, $D3C0, $E288, $6404, $1320, $4A80, $677A, {} $E288, $6406, $3320, $4A80, $6770, $E288, $6406, $2320, $4A80, $6766, {} $E288, $6408, $2320, $2320, $4A80, $675A, $E288, $640C, $2320, $2320, {} $2320, $2320, $4A80, $674A, $E288, $6414, $2320, $2320, $2320, $2320, {} $2320, $2320, $2320, $2320, $4A80, $6732, $5380, $2320, $2320, $2320, {} $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, $2320, {} $2320, $2320, $2320, $51C8, $FFDE, $0480, $0001, $0000, $6ED4, $6002, {} $A02E; {} procedure Copy1 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6710, $5380, $32D8, {} $51C8, $FFFC, $0480, $0001, $0000, $6EF2; {} procedure Copy2 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6406, $12D8, $4A80, $671A, $E288, $6404, {} $32D8, $4A80, $6710, $5380, $22D8, $51C8, $FFFC, $0480, $0001, $0000, {} $6EF2; {} procedure Copy8 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6736, $E288, $6406, {} $32D8, $4A80, $672C, $E288, $6406, $22D8, $4A80, $6722, $E288, $6408, {} $22D8, $22D8, $4A80, $6716, $5380, $22D8, $22D8, $22D8, $22D8, $51C8, {} $FFF6, $0480, $0001, $0000, $6EEC; {} procedure Copy32 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6776, $E288, $6406, {} $32D8, $4A80, $676C, $E288, $6406, $22D8, $4A80, $6762, $E288, $6408, {} $22D8, $22D8, $4A80, $6756, $E288, $640C, $22D8, $22D8, $22D8, $22D8, {} $4A80, $6746, $E288, $6414, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $4A80, $672E, $5380, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $51C8, $FFDE, $0480, $0001, $0000, $6ED4; {} procedure Copy64 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6700, $00B2, $E288, {} $6408, $32D8, $4A80, $6700, $00A6, $E288, $6408, $22D8, $4A80, $6700, {} $009A, $E288, $640A, $22D8, $22D8, $4A80, $6700, $008C, $E288, $640C, {} $22D8, $22D8, $22D8, $22D8, $4A80, $677A, $E288, $6414, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $4A80, $6762, $E288, $6424, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $4A80, $673A, $48E7, $7F3E, {} $5380, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, $4CD8, $7CFE, $48D1, {} $7CFE, $D2FC, $0030, $4CD8, $30FC, $48D1, $30FC, $D2FC, $0020, $51C8, {} $FFDA, $0480, $0001, $0000, $6ED0, $4CDF, $7CFE; {} procedure Copy128 (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6700, $0108, $E288, {} $6408, $32D8, $4A80, $6700, $00FC, $E288, $6408, $22D8, $4A80, $6700, {} $00F0, $E288, $640A, $22D8, $22D8, $4A80, $6700, $00E2, $E288, $640E, {} $22D8, $22D8, $22D8, $22D8, $4A80, $6700, $00D0, $E288, $6416, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $4A80, $6700, $00B6, {} $E288, $6426, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $4A80, $6700, {} $008C, $48E7, $7F3E, $E288, $6428, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, {} $0030, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, $4CD8, $30FC, $48D1, {} $30FC, $D2FC, $0020, $4A80, $6756, $5380, $4CD8, $7CFE, $48D1, $7CFE, {} $D2FC, $0030, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, $4CD8, $7CFE, {} $48D1, $7CFE, $D2FC, $0030, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, {} $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, $4CD8, $001E, $48D1, $001E, {} $D2FC, $0010, $51C8, $FFB6, $0480, $0001, $0000, $6EAC, $4CDF, $7CFE; {} procedure Copy128CacheFriendly (src, dst: univ Ptr; count: Longint); inline $201F, $225F, $205F, $E288, $6404, $12D8, $4A80, $6700, $00AA, $E288, {} $6408, $32D8, $4A80, $6700, $009E, $E288, $6408, $22D8, $4A80, $6700, {} $0092, $E288, $640A, $22D8, $22D8, $4A80, $6700, $0084, $E288, $640C, {} $22D8, $22D8, $22D8, $22D8, $4A80, $6772, $E288, $6414, $22D8, $22D8, {} $22D8, $22D8, $22D8, $22D8, $22D8, $22D8, $4A80, $675A, $48E7, $7F3E, {} $E288, $641C, $4CD8, $0CFC, $48D1, $0CFC, $D2FC, $0020, $4CD8, $0CFC, {} $48D1, $0CFC, $D2FC, $0020, $4A80, $6732, $5380, $4CD8, $7CFE, $48D1, {} $7CFE, $D2FC, $0030, $4CD8, $7CFE, $48D1, $7CFE, $D2FC, $0030, $4CD8, {} $30FC, $48D1, $30FC, $D2FC, $0020, $51C8, $FFDA, $0480, $0001, $0000, {} $6ED0, $4CDF, $7CFE; {} End of Pascal INLINE source. - ------------------------- From: lim@iris.ucdavis.edu (Lloyd Lim) Subject: BlockMove (was Re: Fatest code to fill memory?) Date: 12 Feb 92 10:23:01 GMT Organization: U.C. Davis - Department of Electrical Engineering and Computer Science In article <D88-JWA.92Feb11182842@hemul.nada.kth.se> d88-jwa@hemul.nada.kth.se (Jon W{tte) writes: >.ch> neeri@iis.ethz.ch (Matthias Ulrich Neeracher) writes: > > For an interesting study in loop unrolling, take a look at Apple's > implementation of _BlockMove. > >True. Especially on 040 ROMs where they move a cache line each time :-) >(Couldn't youjust get the address of BlockMove and call that directly ? >That might be fast enough !) The original post was about filling instead of moving but since we're off the subject... :-) TN 261 says that BlockMove invalidates the cache for sizes larger than 12 bytes because you could be moving code. I haven't seen anyone mention this here. Does it only invalidate addresses in the destination or does it invalidate the whole thing? I'd think that if it trashes the whole thing and you're just moving data (which is probably 99.99% of the time, or 100% if you are well-behaved), it'd be faster to call your own routine to move it. Even simple routines would probably beat a complete loss of the cache. +++ Lloyd Lim Internet: lim@iris.cs.ucdavis.edu America Online: LimUnltd Compuserve: 72647,660 US Mail: 224 Lysle Leach Hall, U.C. Davis, Davis, CA 95616 - ------------------------- From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) Subject: Fastest code to fill memory? Date: 12 Feb 92 10:30:23 +1300 Organization: University of Waikato, Hamilton, New Zealand (Reading comments from people about the awkwardness of using DBEQ in a fast loop...) Pardon me if I'm pointing out something obvious, but you people *do* realize there's a version of the DBcc instruction family that _doesn't_ test the condition codes, don't you? Lawrence D'Oliveiro fone: +64-7-856-2889 Computer Services Dept fax: +64-7-838-4066 University of Waikato electric mail: ldo@waikato.ac.nz Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+13:00 - ------------------------- From: taihou@iss.nus.sg (Tng Tai Hou) Subject: Fast memory fill results! Organization: Institute of Systems Science, NUS, Singapore Date: Wed, 12 Feb 1992 11:12:00 GMT I asked for help recently on the subject. I received more than 20 replies. Thanks to you folks, I have written two versions of what I think is the fastest MemSet yet!!! Or at least, the fatest with my current knowledge. I believe I have not taken full advantage of the cache in the 68020, 030 and 040, or faster opcodes. Maybe someone can enlighten me. /* This is the version is 'C'. It first computes the remainder of count%4. If zer0, performs longword memory fills. Else, it computes the nearest longword boundary, fills that, and then fills the remainder (1, 2, or 3) in bytes. Note that value is a longword, and each of its 4 bytes contain the value (in my case an 8-bit color value) */ void MyAMemSet (register unsigned long *buf, register long value, register unsigned long count) { register long rem = count%4; /* rem = count & 0x00000003 */ register int i; register unsigned char *p, val; if (rem == 0) { for (i=0; i<count; i+=4) *buf++ = value; } else { count &= 0xFFFFFFFC; /* count = count/4*4 */ for (i=0; i<count; i+=4) *buf++ = value; val = value & 0x000000FF; p = (unsigned char*)buf; for (i=0; i<rem; i++) *p++ = val; } } /* This is a completely handcrafted version. */ void MyAMemSet (/*register unsigned long *buf, register long value, register unsigned long count*/) { asm { move.l 4(sp), d0 movea.l d0, a0 /* a0 = buf */ move.l 8(sp), d1 /* d1 = value */ move.l 12(sp), d2 /* d2 = count */ move.l d2, d3 /* d3 = rem */ and.l #0x00000003, d3 bne.s @else @5: move.l d1, (a0)+ /* do *buf++ = value */ subq.l #4, d2 bne.s @5 bra @2 @else: and.l #0xfffffffc, d2 @1: move.l d1, (a0)+ /* do (*buf++ = value */ subq.l #4, d2 bne.s @1 @3: move.b d1, (a0)+ subq.l #1, d3 bne.s @3 @2: } } I appreciate all comments and criticisms. Please post to the newsgroup for everyone's benefit. Thanks. Tai Hou Singapore - ------------------------- From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) Subject: Fast memory fill results! Date: 13 Feb 92 10:14:49 +1300 Organization: University of Waikato, Hamilton, New Zealand In article <1992Feb12.111200.25672@nuscc.nus.sg>, taihou@iss.nus.sg (Tng Tai Hou) offers a handcrafted memset routine, which I think can be speeded up a little more. > /* > This is a completely handcrafted version. > */ > void > MyAMemSet (/*register unsigned long *buf, register long value, register > unsigned long count*/) > { > asm { > move.l 4(sp), d0 > movea.l d0, a0 /* a0 = buf */ > move.l 8(sp), d1 /* d1 = value */ > move.l 12(sp), d2 /* d2 = count */ > > move.l d2, d3 /* d3 = rem */ > and.l #0x00000003, d3 > bne.s @else > > @5: move.l d1, (a0)+ /* do *buf++ = value */ > subq.l #4, d2 > bne.s @5 > bra @2 how about this for a replacement of the sequence from @5: bra.s @59 @51: swap d2 @52: move.l d1, (a0)+ @59: dbra d2, @52 swap d2 dbra d2, @51 bra @2 > > @else: and.l #0xfffffffc, d2 > @1: move.l d1, (a0)+ /* do (*buf++ = value */ > subq.l #4, d2 > bne.s @1 how about: @else: lsr.l #2, d2 bra.s @19 @11: swap d2 @12: move.l d1, (a0)+ @19: dbra d2, @12 swap d2 dbra d2, @11 > > @3: move.b d1, (a0)+ > subq.l #1, d3 > bne.s @3 This loop will never iterate more than four times, so it's probably not worth speeding up. > > @2: > } > } Lawrence D'Oliveiro One-trick asm pony - ------------------------- From: twillis@ec.ecn.purdue.edu (Thomas E Willis) Subject: Fatest code to fill memory? Organization: Electrical Engineering, Purdue University Date: Thu, 13 Feb 1992 15:00:36 GMT In article <1992Feb11.180644.21941@ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes: [tons off stuff deleted] >zero". In fact, given that "dbeq" is so complicated, I wouldn't >be surprised if "subq.l"/"bne.s" weren't faster, or at least >nearly the same speed as "dbeq". I'd say, off hand, that "dbeq" >is useless. if i remember my 68k times right, db<cc> is faster than a subq/bne pair. to get around the testing for equal problem, use "dbf" (which always takes the branch unless the register decrements to -1). you do have to "bias" your loop counter since you're looping until the index register hits -1. for example: moveq #9,d0 ; causes us to do the move thing 10 times @1: move.l d1, (a0)+ dbf d0, @1 this should also work if d1 happens to be 0 (which would cause the dbeq code to fall through on the first iteration). dbeq is useful in other situations when you want to get out of the loop on equal (moving bytes until the first zero?), but it has some "features" that make it not quite right for blasting bits around. kinda the old "right tool for the job" type question. just my $0.02 worth... -- - t - ------------------------------------------------------------------------- Tom Willis / "These are dangerous days, to say what you feel Purdue Electrical Engr. / is to dig your own grave." - Sinead O'Connor, twillis@ecn.purdue.edu / "Black Boys on Mopeds" - ------------------------- From: ahbritto@iat.com (Arthur H. Britto) Subject: Fatest code to fill memory? Date: 13 Feb 92 03:02:57 GMT Organization: Information Access Technologies, Inc. of Berkeley, CA jesjones@milton.u.washington.edu (Jesse Jones) writes: > Chris Tate is right: assembly code is definitely the best way to go >if you want your graphic routines to run as fast as possible. The code >he has is fine as long as you remember that the decrement and branch >instructions are restricted to word length counter registers. Faster than the MOVE.L is the MOVEM instruction. Save off all the resgisters you can and then use the MOVEM instruction. Of course, you should unwrap your loops for maximum effect. Better yet, have no loops and calculate a branch into the code. This is how Armor Alley does it. Arthur Britto -- - ---------------------------------------------------------------------------- Information Access Technologies, Inc. Internet: ahbritto@iat.com 46 Shattuck Square, Suite 11 Applelink: ahbritto@iat.com@internet# Berkeley, CA 94704-1152 Voice: 510-704-0160 Fax: 510-704-8019 - ------------------------- From: deadman@garnet.berkeley.edu (Ben Haller) Subject: Fatest code to fill memory? Date: 15 Feb 92 00:02:58 GMT Organization: Stick Software In article <1992Feb13.030257.13145@iat.com> ahbritto@iat.com (Arthur H. Britto) writes: >Of course, you should unwrap your loops for maximum effect. This is a common fallacy. In fact, on any Mac upwards of the 68000-based ones, unrolling loops kills your performance unless you are very careful. You need to make sure that your loop will fit inside the cache size of the chip you're running on (different chips have different sizes, too). In general if you're doing memory accesses (filling memory with a value) the instructions will pipeline such that the loop branch instruction effectively takes a negligable amount of time anyway. Unrolling may speed things up, but beware - if you unroll it too far, you'll start thrashing the instruction cache, and your performance will go down the tubes. -Ben Haller (deadman@garnet.berkeley.edu) --------------------------- End of C.S.M.P. Digest **********************